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Abstract 

Recently, a fully implicit, energy- and charge- conserving particle-in-cell method has been 
proposed for multi-scale, full-/ kinetic simulations [G. Chen, et at, J. Comput. Phys. 230, 
18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver, capable of 
using very large timesteps without loss of numerical stability or accuracy. A fundamental 
feature of the method is the segregation of particle-orbit computations from the field solver, 
while remaining fully self-consistent. This paper describes a very efficient, mixed-precision 
hybrid CPU-GPU implementation of the implicit PIC algorithm exploiting this feature. 
The JFNK solver is kept on the CPU in double precision (DP), while the implicit, charge- 
conserving, and adaptive particle mover is implemented on a GPU (graphics processing unit) 
using CUDA in single-precision (SP). Performance-oriented optimizations are introduced 
with the aid of the roofline model. The implicit particle mover algorithm is shown to achieve 
up to 400 GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU 
efficiency against the peak theoretical performance, and is about 300 times faster than an 
equivalent serial CPU (Intel Xeon X5460) execution. For the test case chosen, the mixed- 
precision hybrid CPU-GPU solver is shown to over-perform the DP CPU-only serial version 
by a factor of ~ 100, without apparent loss of robustness or accuracy in a challenging long- 
timescale ion acoustic wave simulation. 

1. Introduction 

Particle-in-cell (PIC) methods were developed in the 1960's for the simulation of plasma 
systems with many particles interacting via electromagnetic fields. The technique employs 
the method of characteristics to follow discrete volumes (finite-size particles) in phase space, 
with fields defined on a discrete mesh in physical space. In a typical PIC timestep, particle 
orbits are integrated to find new positions and velocities for given fields. New fields are 
found by solving Maxwell's equations (or a subset thereof) using new moments (charge 
density and/or current) computed from particles. Interpolation operations are defined to 
exchange information between particles and mesh quantities. The method has been very 
successful in its application to many areas in plasma physics [1] and beyond. 

Due to the intrinsic data-parallel nature of the orbit integrals, PIC methods have been 
particularly successful in exploiting current (petascale) supercomputers [2|, yj. However, 
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with the current trend toward million-way parallelism, achieving high performance and high 
efficiency for PIC simulations on future supercomputers will be non-trivial. Future parallel 
computing systems will place strong constraints on algorithms both in the amount of memory 
available and in the cost of accessing it (memory operations are much slower than processor 
computations in modern computers, with the gap most likely enlarging in the near future 
Q). As a result, memory-bounded algorithms will be more challenged to utilize the hardware 
efficiently. Most current PIC time-stepping algorithms are explicit, with numerical stability 
constraints limiting work per particle with a single, small timestep. As a result, explicit PIC 
algorithms are typically memory-bounded, and thus are critically affected by the memory 
bottleneck. 

Nevertheless, there have been fairly successful efforts in porting explicit PIC algorithms 
to new computing architectures such as graphics processing units (GPUs) In a 

2D electrostatic PIC algorithm is implemented on a GPU. A particle data structure is pro- 
posed to match the GPU architecture so that the algorithm can be adapted to different 
problem configurations. Reference [6] describes a GPU implementation of a 2D fully rela- 
tivistic, electromagnetic PIC code, and introduces a strategy to improve performance of a 
charge-conserving current deposition scheme (which would otherwise require many condi- 
tional branches). Reference |7fl further describes an implementation of a 2D fully relativistic, 
electromagnetic PIC code on a GPU-cluster via domain decomposition, featuring MPI for 
inter-node communication. References (8 10 1 report efforts to improve the memory effi- 
ciency of particle-grid interpolations and to reduce memory usage in gyrokinetic codes. In 
Ref. [Hj, a speedup of about 2 is reported with a hybrid CPU-GPU simulation approach 
vs. a CPU-only one on tens of thousands of nodes. 

In this study, we focus on implementing a novel fully implicit PIC algorithm for electro- 
static simulation fl2| on a heterogeneous CPU-GPU architecture. Unlike explicit PIC, the 
fully implicit PIC algorithm does not feature a numerical stability timestep constraint. The 
method can solve either the coupled Vlasov-Ampere (VA) or Vlasov-Poisson (VP) system of 
equations, discretized to feature exact local charge and global energy conservation theorems. 
A Jacobian-free Newton-Krylov (JFNK) nonlinear solver is used to converge the nonlinear 
residual to a tight nonlinear tolerance. Crucial to the implicit PIC algorithm is the concept 
of particle enslavement, whereby the integration of particle orbits is an auxiliary computa- 
tion, segregated in the evaluation of the nonlinear residual. Particle enslavement has several 
advantages. Firstly, it results in much smaller residual vectors (and therefore in a much re- 
duced memory footprint) because only fields are dependent variables in the solver. Secondly, 
it affords one significant freedom to perform the particle orbit integration. In Ref. (l2| . this 
freedom was exploited to implement a self-adaptive, sub-stepping, charge-conserving particle 
mover. 

Despite the advantages of particle enslavement, the particle integration step remains the 
most expensive element in the fully implicit PIC algorithm. This is so because particle orbits 
need to be computed for every nonlinear residual evaluation, and many such residuals are 
computed as JFNK converges to a solution. This issue is exacerbated further by the ability 
of implicit PIC to use large timesteps, which requires many orbit integration sub-steps per 
timestep in the particle mover. 

Being the most time-consuming operation in the fully implicit PIC algorithm, the par- 
ticle orbit integration is the obvious target for hardware acceleration. Our implementation 
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exploits the flexibility afforded by particle enslavement, and targets the particle orbit inte- 
gration step for the GPU, while the field solver remains on the CPU. Contrary to most PIC 
algorithms, we show that the implicit particle mover is compute-bounded, and therefore has 
the potential of efficiently utilizing both the extreme multi-threading capability and the large 
operational throughput of the GPU architecture. Communication between CPU and GPU 
routines involves particle moments only (which are grid quantities), and not the particles 
themselves, thus minimizing the impact of memory bandwidth bottlenecks. 



We use the roofline analytical performance model|13| to help understand the performance 
limitations and bottlenecks of the algorithm and achieve high performance on the GPU. 



• Global memory operations are very expensive, and can severely limit the throughput 
of the simulation. 

• Not all arithmetic operations are equally fast. The slow operations, such as square 
root and division, can hinder the computational throughput. 

• CUDA employs a lockstep execution paradigm within a warp, which comprises 32 
threads. Divergent control flow is allowed, but results in performance degradation. 

• Memory collisions occur when many threads in parallel try to access the same memory 
location at the same time. Resolving them serializes the code and can become the 
bottleneck in parallel computations. 

To mitigate the impact of these constraints on GPU performance, we have implemented a 
series of thread-level and warp-level optimizations in the particle orbit computation without 
sacrificing accuracy. As a result of these optimizations, our implicit particle mover achieves 
up to 300-400 GOp/s for VA and VP, respectively, on an Nvidia GeForce GTX580, with the 
VP approach performing better on account of the memory-collision issue. The corresponding 
GPU efficiency is 20—25% of peak performance. The accuracy and performance of the overall 
hybrid CPU-GPU implicit PIC algorithm is demonstrated using a challenging, long-timescale 
ion acoustic wave (IAW) simulation. It is shown that a mixed-precision implementation, in 
which the CPU JFNK code uses double-precision and the GPU particle mover code uses 
single-precision, can be sufficient for accuracy. For the test case chosen, this setup results 
in speedups of the hybrid CPU-GPU algorithm vs. the CPU-only one up to 100. A defect- 
correction approach has also been implemented that enables the hybrid algorithm to deliver 
double-precision results. In this case, about a third of the GPU calls per time step are made 
in double precision, and the speedup drops to ~ 40. This should be compared to a factor 
of 25 speedup obtained when all GPU calls are made in double precision. These speedups 
are consistent with Amdahl's law, as the particle computation takes > 98% of the overall 
computation time for the test case chosen. 

The rest of the paper is organized as follows. Section [2] introduces the Nvidia GPU Fermi 
architecture and the roofline model. Section [3] describes the specific GPU optimizations 
introduced in the adaptive, charge-conserving particle mover. Section H] shows the perfor- 
mance and efficiency results of numerical experiments, including the complete IAW test case 
integrated with the JFNK solver. Finally, we provide some discussion and conclusions in 
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2. An analytical performance model for GPU computing 

As modern computer architectures shift from single- to multi-core or many-core proces- 
sors, parallel programs must be able to exploit increasingly large concurrency efficiently 
This, in turn, places strong emphasis on identifying performance bottlenecks and sources of 
latencies. This task is facilitated when programmers have some basic understanding of the 
underlying hardware, such as the memory hierarchy and the processor computing capabil- 
ities, and target the optimizations for that hardware. In this study, we focus on the GPU 
architecture, which we introduce next. 

2.1. Nvidia GPU architecture 

Contemporary Nvidia GPUs[l9| are capable of performing scientific computations pro- 
grammed in high-level languages such as CUDA C/C++ and CUDA Fortran, with high 
accuracy (supporting IEEE standards) and high performance (theoretical throughput over 
trillion floating-point operations per second or TFLOPS). GPUs consist of many processing 
units. For instance, the newest Nvidia GPU to date, named Fermi, has up to 16 stream- 
ing multi-processors (SM). Each SM contains 32 processors, or CUDA cores, which perform 
floating-point, integer, and logic operations. SMs also contain 4 special function units (SFU), 
which calculate fast floating-point approximations to certain complex operations such as re- 
ciprocal, reciprocal square root, etc. 

GPUs also contain its own memory system. From slow to fast, one finds off-chip global 
memory, on-chip shared memory, and on-chip register files. In addition, Fermi GPUs are 
equipped with a two-level (LI and L2) read/write cache hierarchy. Specifically, Fermi GPUs 
contains 2 to 6 GB global memory and 768 KB L2 cache; each SM contains 64 KB con- 
figurable shared memory/Ll cache, and 128 KB registers. A unique feature of the GPU 
memory system is that all levels (except for the L2 cache) can be explicitly managed by the 
programmer. 

In order to gain insight into the maximum performance and efficiency of a given algo- 
rithm running on a GPU, we adopt an analytical performance model, the so called roofline 
model[l3|. We proceed to introduce the roofline model and its application for the Nvidia 
GPU architecture. In the sequel, we assume that all computational operations are on 32-bit 
words (e.g., single-precision) unless otherwise specified. 

2.2. Roofline model 

The roofline model is motivated by the fact that the bandwidth of current computer 
off-chip memory is often much slower than the throughput of the processing unitj2o|. For 
instance, the Nvidia GeForce GTX580 GPU has a DRAM bandwidth of 192 GB/s, whereas 
its peak floating-point operational throughput is 1581 GFLOPS. This large discrepancy 
makes identifying whether the program is memory-bounded or compute-bounded critical to 
target optimizations. If memory-bounded, the program should maximize the use of fast 
memory; if compute-bounded, it should minimize the number of operations. 

The roofline model provides a simple method to determine whether an algorithm is either 
memory-bounded or compute-bounded. The key figure of merit is the operational intensity 
(01), defined as the ratio of compute operations to memory operations. An algorithm is 
compute-bounded for 01 higher than the balanced 01, and is memory-bounded otherwise. 
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The balanced 01 of the target device is defined as the ratio of peak-operational throughput 
to memory bandwidth (which is about 8 FLOP/B for the Nvidia GeForce GTX580 GPU). 

An algorithm's compute efficiency is commonly defined as the ratio of its FLOPS vs. the 
peak theoretical compute performance. On a Nvidia GPU, the latter is calculated as 

(number of cores) x 2Flop/CC x (clockrate), (1) 

where the factor of 2 comes from the fused multiply-add (FMA) operation, which computes 
one AXPY operation (i.e., a x x + y) per clock cycle (CC). On the GeForce GTX580 GPU, 
number of cores = 512, and clockrate = 1.544 GHz, resulting in 1.58 TFLOPS. Note that, 
by definition, the peak theoretical performance may be reached by algorithms based on FMA 
operations only. However, most algorithms in scientific computing mix floating-point, inte- 
ger, and logic operations. Such algorithms cannot reach the peak theoretical performance, 
no matter how well optimized. Therefore, for a given algorithm, it is useful to define an 
intrinsic efficiency based on its specific operations. 

We define the theoretical operational throughput as the maximum theoretical performance 
of a compute-bounded algorithm. It can be calculated as 

(number of SM) x (average operational throughput) x (clockrate), (2) 

where 

(average operational throughput) = (operations) / YJ( clock cycles) (3) 

is the average number of operations per clock cycle per stream multi-processor. The theoreti- 
cal operational throughput assumes that all memory and instruction latencies are completely 
hidden or negligible, without performance overhead of any kind. The intrinsic efficiency is 
defined as the ratio of the actual operational throughput vs. the theoretical one (Eq. [2]). It 
indicates an algorithm's real effectiveness in using a given target hardware. 

In what follows, we compare the operational throughput of several basic operations, which 
are the building blocks of our particle mover algorithm, on the Nvidia Geforce GTX580 GPU 
in the context of the roofline model. 



2.3. Operational throughput 

Each CUDA SM of the Nvidia GPU Fermi GF100 architecture H has 32 floating-point 
units (FPUs), 32 integer arithmetic logic units (ALUs), and 4 SFUs, capable of processing 
different types of computations simultaneously. Depending on the hardware implementa- 



tions, the throughput of different computational operations varies 17). To illustrate this, we 
have created a simple CUDA code to micro-benchmark (similar to those used in Ref.j22l. |23|) 
the throughput of some basic floating-point and integer operations. In these tests, many iden- 
tical operations (using unrolled loops) are performed by each thread, and many concurrent 
threads (with 100% occupancy) are employed. The test code is compiled with nvcc v4.0 
compiler. The assembly code generated by the PTX (or cuobjdump) tool is examined to 
ensure that instructions are executed as intended. Results are shown in FigJU from which 
we can make the following observations: 

• As expected, FMA reaches the peak theoretical performance for large OIs. 
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Figure 1: Roofline model for Nvidia GeForce GTX580 on a log 2 -log 2 scale. The performance lines are 
obtained by artificial algorithms which repeat a single operation multiple times. Saturation of the curves 
shows the maximum operational throughput in the compute-bounded regime for a given operation. In the 

plot, sqrt, /, rsqrt, and fdividef stand for IEEE compliant square root, IEEE compliant division, intrinsic 

reciprocal square root, and intrinsic fast division, respectively. 
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Table 1: Throughput of floating point (FP), arithmetic and logic (AL), and special function (SF) operations 
on GeForce GTX 580. Throughput is measured as operations per clock cycle per multiprocessor. The 
compiler nvcc v4.0 is used to produce the results. 
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• The peak performance of integer operations (multiply-add) is at best half of the peak 
theoretical performance. In other words, integer operations are slower than floating- 
point operations. 

• Standard (IEEE compliant) square root (sqrt) and division(/) operations are one to 
two orders of magnitude slower than simpler floating-point or integer operations. 

• The throughput of intrinsic reciprocal square root (rsqrt), or intrinsic fast division 

in single-precision ( fdividef), hardware-implemented in SFUs, is exactly 16 times 

lower than the peak theoretical performance. This is expected, as it equals the ratio of 
the number of FPU (32) and SFU(4) multiplied by 2 (operations per FMA). Intrinsic 
operations are much faster than IEEE versions at the cost of being less accurate. We 
will show in later sections that, under some situations, fast intrinsic functions can be 
accompanied by a few FMA operations to improve accuracy. The hybrid use of SFUs 
and FPUs has the advantage of exploiting both units concurrently, and therefore offers 
the potential to achieve sufficient accuracy with enhanced performance. 

Table [T] lists the maximum throughput of the operations on the GPU found by the micro- 
benchmarks. With the throughput of each basic operation known, we can now calculate the 
average operational throughput for a particular algorithm [Eq.Q], with the total number of 
clock cycles required to execute those operations calculated as Yli-^i x 32/OTj. Here, Ni is 
the number of operations of type i, and OTi is the operational throughput from Table [TJ 

We proceed to analyze in detail the GPU implementation of the implicit particle mover 
of Ref.0. 

3. GPU implementation and performance analysis of the adaptive, charge-conserving 
particle mover 

We pursue the development of a hybrid CPU-GPU algorithm in which the particle push 
is farmed off to the GPU, while the nonlinear solver remains on the CPU. In our implementa- 
tion, particles reside in the GPU global memory, and no particle transfer is needed between 
CPU and GPU for the field solver. Only grid quantities (i.e., the electric field and moments 
collected from the particles) are transferred between the two architectures. Nonlinear iter- 
ations are performed on the CPU until nonlinear convergence is achieved (convergence is 
enforced on field quantities, with particle quantities obtained self-consistently). The process 
is repeated every timestep. 

This section focuses on the implicit particle mover (l2| . which features three main prop- 
erties: self-adaptivity, automatic local charge conservation, and absolute numerical stability. 
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Self-adaptivity is achieved by particle sub-stepping, with the sub-timestep controlling the 
discretization errors of the orbit integral. In each sub-step, particles are pushed along a 
straight line. Local charge conservation is automatically enforced by ensuring particles land 
at cell boundaries during the orbit integration process. A Crank-Nicolson (CN) integration 
scheme, which is time-centered and implicit, is employed to guarantee numerical stability for 
arbitrary timesteps. 

It is well known that contemporary GPUs are capable of launching a large number of 
threads (tens of thousands on current-generation devices) in a SPMD (single program, mul- 
tiple data) style. They are very attractive for particle simulations, as particle orbits are 
independent of each other. However, the highly dynamic nature of the implicit mover may 
prevent the algorithm from achieving high performance and efficiency on the GPU. To begin 
with, the simulated systems are typically highly nonlinear and inhomogeneous. As a result, 
threads pushing particles with different positions in phase space will experience very differ- 
ent workloads. When large timesteps are employed, each particle follows an orbit according 
to local conditions, undergoing an indefinite number of sub-timesteps and cell-crossings. 
Similarly, an iterative solution of the coupled Crank-Nicolson equations introduces unpre- 
dictability in the algorithm, for the number of iterations for convergence is unknown and 
particle-dependent. The resulting unpredictable logical paths (per thread) create divergent 
branches, serializing the executions in a given warp (32 threads). Non-divergent branches 
may also be created, with some idle threads waiting for others to finish. It follows that it 
may be very difficult to keep all threads busy, resulting in parallel performance degradation. 

Additional performance degradation can result from inefficient memory operations. One 
prime example is parallel scatter-gather operations, such as field interpolations to particles 
and moment integration from particles, which can significantly slow down the simulation. 
Random access to the field may have a large performance penalty, for instance, if the LI 
cache miss-rate is high. Moment accumulation may hinder the computation due to the cost 
of resolving memory collisions. This occurs when two or more threads try to write the same 
memory location simultaneously. Ensuring correctness requires using atomic operations, 
which serialize the accumulations. 

Despite these challenges, we demonstrate in the following sections that the implicit par- 
ticle mover algorithm can in fact be relatively efficient on the GPU. With the aid of the 
roofline model, we identify that the implicit particle mover is compute-bounded, and there- 
fore significant performance improvement can be achieved through targeted optimizations. 
We motivate specific optimizations with a baseline GPU implementation of the mover (de- 
scribed below), aiming at minimizing clock cycles (per thread) without accuracy degradation. 
We also describe warp-level optimizations, including our particle-sorting strategy and the use 
of a warp vote function, to minimize load imbalance and control-flow latencies. 

Before getting into the detailed optimizations of the mover, we introduce its basic memory 
management. There are mainly two groups of memory operations. The first group involves 
reading particle quantities {v p ,x p ,i p } (denoting velocity, position, and cell index, respectively) 
at the timestep n from global GPU memory, and writing the updated quantities at the 
timestep (n+1) to global GPU memory. Since the particle-orbit calculations are independent 
of each other, the load and store of the particle quantities, which we group in a structure- 
of-array fashion, are trivially coalesced (making stride-one access of the global memory for 
optimal performance[18]). The second group involves reading/writing grid quantities, such 
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as the electric field E, current density j, or charge density p. The E field is read-only; either 
LI cache or texture memory can be exploited to accelerate repeated readings. Our approach 
to the accumulation of moments (such as j or p), which is often found to be the bottleneck 
of many PIC implementations @, Isl, 1Q|, is discussed in detail below (Sec. 13. 4p . 



To frame the subsequent discussion, we divide each particle sub-step into four algorithmic 
elements: 

1. Estimate the sub-timestep At. 

2. Integrate the orbit over At using a Crank-Nicolson scheme. 

3. If a particle crosses a cell boundary, make it land at the boundary. 

4. Accumulate the current density on the grid points (for VA, but not for VP). 

This process is repeated over many sub-steps until the end of the timestep, at which point we 
either take the orbit average of the current density (for VA), or accumulate the charge density 
(for VP). In the following, we describe the baseline algorithm and targeted optimizations for 
each element. 

3.1. Algorithmic element j^l: estimate of the sub-timestep 

The first algorithmic element employs a standard local-error-control method[24j to esti- 
mate the sub-timestep, by taking the difference between the truncation errors of the Euler's 
scheme (a first-order method) and Heun's scheme (a second-order method) to be smaller 
than a specified tolerance. The resulting formula reads (l2| 

||Ze(Ar)|| 2 <£ a + £ r ||r°(Ar)|| 2 , (4) 

where ||-|| 2 denotes the L2-norm of enclosed vector, le(Ar) = {a^, (^-v^j } is the 

local truncation error of the sub-timestep is, e a and e r are absolute and relative tolerances, 
respectively, and r°(Ar) = {v"oZ}At is the initial residual. For da p /dx, we take 

dap/dx^^Ei-Ei^/Ax, (5) 
m 

when the particle is in cell i (or between grid points % — \ and %) at time level v. This is exact 
for linear interpolations. It follows that the estimate can be found by solving a quadratic 
equation for At. 

A direct implementation of the above method would require about 33 floating-point 
operations, 6 special function operations, and 1 division operation. To optimize the method, 
we replace the L2-norm with the Li-norm, e.g., , afc}^ = + I^Ij which suffices for 
the error estimate without requiring a square root. The equation for At [from Eq.Q] can 
be written as 

aAr 2 - At - 7 2 = 0, (6) 

where a = Qa^| + (^rv p ^ j/2, ft = e r (|a£| + \v%\), 7 2 = e a . To optimize the com- 
putation further, we avoid solving the quadratic equation by noting that it is sufficient to 
estimate At from the absolute and relative tolerances separately: 

aAT 2 = 7 2 , (7) 
«Ar = (3, (8) 



9 



and then take At = max^d, (3d 2 ), with d = a~^ 2 computed via the intrinsic rsqrt function 
only once. 

As a result of these optimizations, the estimate of sub-timestep requires 30 floating-point 
operations (including 6 FMAs) and 1 special-function operation. The optimized algorithm 
has reduced the per-thread clock cycles from 106 to 28, largely achieved by replacing three 
square roots and one division with just one reciprocal square-root operation. 

3.2. Algorithmic element j^2: Crank- Nicols on particle move 

The second element of the mover is a CN step using the estimated sub-timestep At: 

. p = < +1/2 , (9) 
At v p 

—K ~ = < + V2 > (10) 

At" p 

where 1 /2 denotes a mid-point average, i.e., Vp +1 ^ 2 = (v" + ify jl. These equations are im- 
plicit and coupled (but not stiff), and are generally solved by a fixed-point iterative method, 
e.g., Picard's method. 

It turns out that, when employing first-order B-spline interpolations, Eq.Q and (fTUj) can 
be solved directly as (omitting the subscript p) 



v 



o v At v + 

v+1 _ 



1 _L f da (At) 
L ^ 1 dx 4 



da (At) 2 
dx 4 



where da/dx is given by Eq.(jSD. The division in Eq. (llll) can be replaced by the intrinsic 
reciprocal function without loss of accuracy, as follows. We write v u+l = NR = N x 
RCP(-D) where N, D stand for the numerator and denominator in Eq. (TTTj) . respectively, 
and R = KCP(D) stands for the fast intrinsic reciprocal function of CUDA. Note that the 
maximum ULP (unit in the last place) error of the fast reciprocal is about l|2oT| (only slightly 
lower than IEEE required 0.5 ULP precision). This is precise to the 8th significant digit, 
as the last bit of the single precision number corresponds to 5.96 • 10~ 8 . To reduce the 
error further, we take one more iteration of Newton-Raphson's method such that v u+l = 
NR{2- DR)^. 

The optimization introduced here involves two steps: it first eliminates the control-flow 
condition needed in the convergence test of the nonlinear iteration, and then replaces the 
division by the intrinsic fast reciprocal function. In the latter step, one extra step of Newton- 
Raphson is taken to ensure sufficient precision. The optimization slightly increases the 
number of operations from 11 to 14, but the number of clock cycles of this step is significantly 
reduced from 43 to 22. 

3.3. Algorithmic element j^3: Particle cell-crossing 

The third algorithmic element checks whether the particle has moved into another cell 
after the CN move. If a crossing occurs, the particle is forced to stop at the cell boundary. 
The corresponding sub-timestep is found by solving the quadratic equation: 

a U+l/2 

F(At) = At 2 + v v At - Ax v = 0, 
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which is obtained by fixing the final particle position at the boundary in question, and 
combining Eq. and (ITUj) . The corresponding particle velocity can be found according to 
the energy principle to be: 



v »+ l = S ng(Ax 1/ ) yj (v v ) + 2a v+1 l 2 Ax u , (12) 

where sng(Ax iy ) returns the sign of Ax v {= x v+1 — x v ), which signals the direction of particle 
motion. 

The above treatment requires a square root and a division, which can be optimized by 
the following (inexact) Newton's method. The solution is first approximated by 



-v" + sng(Ax u )J (<) 2 + 2ap +1/2 A^ 
Ato = ^XTT^ (13) 

using fast intrinsic functions rsqrt and fdividef. Subsequent iterations are performed as: 

Ar k = Ar fc _! - F(Ar fe _ 1 )/F'(AT fc _ 1 ), (14) 

where F'(Arf.-i) = a u+1 ^ 2 Ar k _i + v u is the Jacobian. A fast division can be applied to update 
Eq. (I14p . This is a safe approximation because the convergence of Newton's method is robust 
against small errors in the Jacobian[27]. Note that each of the applied intrinsic functions has 



a maximum error of 2 ULP 17], which provides excellent initial value for Newton iterations. 
Because of the quadratic convergence rate of Newton's method (i.e., the correct digits double 
for every iteration), this last step ensures that the solution is accurate to the last digit of a 
single-precision number. 

Overall, we have replaced a square root and a division by a fast reciprocal square root 
and two fast divisions. Consequently, the number of clock cycles for particles moving inside 
the cell is reduced from 74 to 47. 

3.4- Algorithmic element j^4 : particle current/ charge accumulation 

In the VA approach, the fourth element employs a standard interpolation procedure to 
accumulate the current density on the grid points from particles: 

i = ^£^< +1/2 ^-*p +1/2 )> 
p 

where Ax is the cell size, and S is the shape function. This is done for every sub-timestep. 
In the VP approach, the charge density is collected from each particle as: 

p = ^J2^- x p +1 )- 

p 

This is done only at the end of the orbit computation. In the baseline implementation, 
all the accumulations are performed on shared memory, using the floating-point atomicAdd 
function, and the final results are written back to global memory at the end of the timestep. 

It is desirable to avoid memory collisions as much as possible in the accumulation process. 
On one hand, memory collisions are completely avoided when each thread accesses its own 
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copy of the physical domain in memory, but this is very demanding memory-size-wise. On the 
other hand, memory-size requirements are minimized when all threads (of one thread block) 
access the same memory domain, but this results in frequent memory collisions. To balance 
best efficiency of the accumulation and limited resources of shared memory, we provide each 



warp with its own memory domain 28J . This avoids memory collisions between threads of 
different warps. Collisions within a warp are resolved by the shared-memory, floating-point 
atomicAdd function. 

A second optimization (for VA only) replaces the shared-memory accumulations of local 
current density in a given cell by register accumulations. This optimization exploits the fact 
that register access is faster than shared memory and collisions with register accumula- 
tions are absent. Specifically, each thread employs two local register variables (one per cell 
face) to accumulate current density as the particle sub-steps within a cell. Register values 
are atomically added to shared memory and reset to zero only when the particle crosses a 
cell boundary. 

Acceleration has been achieved in the moment accumulation by reducing memory col- 
lisions (for both VA and VP), and by reducing the usage of atomic operations (for VA). 
Quantitative results are presented in the following sections. 

3.5. Roofline analysis 

We proceed to show that the implicit particle mover algorithm is compute-bounded. The 
analysis is carried out for VA for brevity. A similar analysis, with similar conclusions, applies 
to VP. 

The particle mover algorithm requires two (read/write) single-precision memory trans- 
fer operations from global memory of three particle quantities {x,v,i} per particle orbit 
integration (recall that the electric field is read-only, and is cached for fast access). The 
corresponding (global) memory throughput is 2x3x4 = 24 B per particle. For the sake 
of argument, we assume 20 sub-steps and 2 cell-crossings in a typical particle orbit per 
timestep, which results in 1762 (1410) computational operations per particle in the baseline 
(optimized) algorithm. It follows that the corresponding operational intensity is: 

Ol^ll ! ba f ne l (Op/B). 
I 59 [optimized] 

This is 7 to 9 times larger than the balanced 01 (see Sec. 12. 2 p . which confirms that the 
implicit particle mover algorithm is compute-bounded. 

3.6. Particle sorting strategy 

It is often beneficial to sort particles, as the memory operations are more efficient with 
improved data reuse. Previous studiesji], @, [lo|, 29| have focused on explicit PIC algorithms. 



Explicit schemes employ small timesteps, so that only a small fraction of the particles that 
cross cells (or sub-domains) need sorting. The goal is to keep the memory space consistent 
with the physical space such that particles that are physically close are also co-located in 
memory. Those particles moving across cells (or sub-domains) need to be rearranged in the 
particle array to preserve memory collocation. The complexity of this process is 
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where N is the total number of particles and 77 is fraction of the particles crossing cell (or sub- 
domain) boundaries. This approach is appropriate for explicit schemes, where the timestep 
is very small. 

Explicit PIC sorting strategies, however, are not suitable for implicit PIC when a large 
timestep is used. In this case, many particles may cross cells in an implicit timestep, thus 
increasing the overhead of shuffling particles in memory and resulting in frequent memory 
collisions. In addition, particles with different velocities require different amount of work 
per orbit integration, which causes load imbalances and divergent branches. Hence, in an 
implicit PIC context, particle sorting should focus on improving thread load-balancing and 
on minimizing memory collisions. We address the former by sorting particles such that 
each warp deals with particles with similar velocities (in sign and magnitude). The latter is 
addressed by having each thread within a warp integrate a particle located in a separate cell 
in physical space. 

Our particle sorting approach features a two-pass implementation. In a first pass, we 
divide the 2D phase space (x — v) into rectangular cells using a Cartesian grid. We label 
each cell with an integer number (starting from zero) via lexicographic ordering along the 
physical coordinate (rows). Assuming that the physical domain is discretized with N g cells, 
the first row in phase space (corresponding to the same velocity interval) will be labeled with 
numbers 0,l,...,iV 9 — 1. The second row (corresponding to the next velocity interval) will be 
labeled with numbers N g ,...,2N g — 1, and so on. Next, we label particles in each cell with 
the corresponding cell number. We then collect the total number of particles within each 
cell, and use a prefix sum (3o| to aggregate particles in previous cells lexicographically. This 
gives: 

w« = X>;> (is) 

j=0 

where Nj is the number of particles in cell j, and AT i0 is the total number of particles up 
to (but excluding) cell i. Here, i,j are lexicographic indexes, and N 00 = 0. In a second 
pass, we sort particles according to their velocities, and we place the particles into a ID 
array in which each continuous and aligned N g particles belong to consecutive N g cells in 
physical space (x), and also have similar velocities. The key in this second pass is to reserve 
the particle locations in the ID array from information collected during the first pass. In 
particular, Nio provides the memory address (starting from index zero) of the ID particle 
array for the first particle in cell i. Additional particles in the same cell with particle index 
p G [1, Nj — 1] are placed in the ID array with memory address N i0 + N g x p. 

When particles are sorted in this way, memory collisions within a warp are avoided as long 
as particles do not cross cell boundaries. Furthermore, the load balance improves because 
particles with similar velocities will have similar orbit lengths, thus improving the efficiency 
of the algorithm at the warp level. This approach needs two copies of the particle array 
(corresponding to timesteps n and n + 1), and the computational complexity of the sorting 
scheme scales with the number of particles. However, only a single sorting step is required 
per implicit timestep, making the overall overhead manageable. 
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3. 7. Control-flow optimization 

The implicit particle mover uses control flows extensively to ensure an accurate orbit 
integration with large timesteps. In this case, performance may degrade on the GPU because 
warp execution results in branches, which are executed sequentially. 

Branches are created, for instance, when a particle crosses a cell boundary, or a particle 
orbit terminates, or memory accumulations collide, etc. Some optimizations introduced in 
the preceding sections already address branching and load balancing, e.g., by solving the 
Crank-Nicholson equations directly (see Sec. I3.2[) . and by reducing memory collisions (see 
Sec. 13.41) . Particle cell-crossings introduce branches because they happen randomly in a given 
thread, thus forcing other threads in the same warp to wait. The quantitative performance 
impact of particle cell-crossing is shown in the next section. 

Branches also appear when particle orbits within a warp do not terminate simultaneously. 
Their impact is ameliorated in our implementation by using the vote function with all 
reduction mode [l7| as the exit condition for a given warp. The vote. all function returns 
true when the exit condition is satisfied for all the threads in a warp, and it can be used 
to break an infinite while-loop (which does not create conditional branches). We have found 
that the warp vote function is more effective for VP than for VA. This is most likely due to 
the fact that increased atomic operations of the VA approach hide load-imbalance latencies. 

4. Numerical experiments 

In this section, we first conduct several numerical experiments to characterize the base- 
line and optimized particle orbit integration algorithms described above. We measure their 
operational throughput on the GPU, and demonstrate the significant performance and effi- 
ciency gains of the proposed optimizations, resulting in a 50-70% intrinsic efficiency (vs. the 
application maximum operational throughput) for VA and VP, respectively, with a corre- 
sponding 20-25% overall efficiency (vs. peak throughput, 1581 GOp/s). We also study their 
performance sensitivity to various external parameters such as the number of threads and 
the timestep size, and conclude that performance is generally robust except for the timestep 
size (which directly affects the 01 of the algorithm). We compare the performance of the 
particle pusher algorithm running on both a CPU (in single precision) and a GPU (in single 
precision), and demonstrate significant speedups (200 — 300). Finally, we integrate the GPU 
particle mover with the full nonlinear solver [12] on a CPU, and test accuracy and wall-clock 
performance for a challenging ion acoustic wave simulation. We demonstrate that a mixed- 
precision hybrid CPU-GPU implementation (with the GPU running in single precision and 
the CPU in double precision) is sufficient from an accuracy standpoint, and that it is able 
to deliver very large wall clock speedups vs. the double-precision CPU-only implementation. 
In what follows, code running on the GPU is in single-precision unless otherwise specified. 

4-1. GPU performance of the implicit particle mover 

The GPU performance of the algorithm is best understood by comparing the theoretical 
operational throughput to that of a real execution. To compute the theoretical operational 
throughput, we adopt operations per second (Op/s), instead of FLOPS, as the figure of 
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Table 2: Breakdown of operations and clock cycles (per thread per sub-step) of the implicit particle mover 
algorithm. In the table, the left number counts operations, and the right one counts clock cycles. 



OP/CC 


FP add,mul, 
fma 


AL add, mul, 
logic, cvt 


SF rsqrt, 
fdividef 


division 


total 


baseline 


64.2 / 48.0 


31.8 / 63.6 


6.2 / 49.6 


2.1 / 67.2 


104 / 228.4 


optimized 


60.1 / 43.5 


26.6 / 53.2 


2.2 / 17.6 


0/0 


88.9 / 114.3 



merit for performance evaluations. This is because floating-point, integer, logic, and special- 
function operations play important roles in our algorithm. When computing the operational 
throughput, we need count all operations, especially the slower ones. 

In the simulations, we set e a = 10~ 8 and e r = 0.02 to be the absolute and relative toler- 
ances of the orbit sub-time-step estimation, respectively (see Sec. 13. ip . The corresponding 
average number of cell-crossings is about 10% of the number of sub-steps. Table [2] lists the 
theoretical number of arithmetic operations and their respective clock cycles for the baseline 
and optimized algorithms. We see that, even though the total number of operations does 
not change much from the baseline to the optimized version, there is a dramatic change in 
the number of clock cycles. By examining the operational throughput of each category, we 
see that the baseline algorithm spends most of the time computing only a small number of 
special functions and divisions, whereas the optimized algorithm significantly reduces the 
use of those operations. As a result, the theoretical performance nearly doubles after the 
optimization. 

Figure [2] shows the operational throughput of the algorithms running on the GPU as 
a function of the number of threads. We see that perfect linear scaling continues beyond 
the physical number of CUDA cores, and performance only saturates when the number of 
threads exceeds the number of cores by a factor of about 20. This is consistent with Little's 
law(3l| . which predicts that the number of threads needed for maximum performance is equal 
to the number of cores multiplied by the instruction latency (~ 18 CC on Fermi GPUsjl8|). 
Achieving the maximum performance of Little's law is possible when 1) latencies, including 
memory-level and instruction-level ones, are hidden by exploiting extreme concurrency with 
many threads, and 2) the architecture allows very fast switching between threads (GPUs 
can switch in one CC). 

As implemented, the measured performance of the optimized algorithm (for VA) is 320 
GOp/s, which corresponds to a 50% intrinsic efficiency and a 20% peak efficiency This is to 
be compared with 130 GOp/s of the baseline algorithm (8% of peak). For the VP approach, 
we get 380 GOp/s (70% intrinsic efficiency, 24% peak efficiency). The performance increase 
of VP vs. VA can be traced to the much larger number of accumulations required by VA. 

Figure [3] shows the run-time breakdown of the algorithm before and after the optimiza- 
tions. The cost of global memory operations is negligible compared to other operations, 
confirming that the algorithm is compute-bounded. The most time-consuming parts are the 
time estimator and the Crank-Nicolson mover. They achieve a significant speedup, a result of 
both thread-level (through modifications of the algorithm) and warp-level (through particle 
sorting and vote function) optimizations. The optimizations in the cell-crossing algorithmic 
element are also effective. 

Figure H] shows the sensitivity of the performance and efficiency of the optimized particle 
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Figure 2: Scaling of implicit particle mover (optimized for VA). The dotted line indicates the number of 
parallel CUDA cores available on the GPU. The peak performance reaches 320 GOp/s. 
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Figure 3: Run-time breakdown of the mover algorithm for the VA approach. The load (LD) and store (ST) of 
1,048,576 particle quantities on the global memory take 0.2 ms. The VP algorithm features similar timings, 
except for the atomic-accumulation step (which becomes negligible). 
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Figure 4: Performance and efficiency of the mover algorithm in single precision on the GPU under conditions 
varying the timestep (At), the field amplitude (\E\), and the number of particles (N p ). For large timesteps, 
the performance is insensitive to the field strength or the particle number. 
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Figure 5: CPU and GPU run time comparison for the baseline and optimized mover algorithm of VA and 
VP approaches (log scale). The single-precision GPU versions are two orders of magnitude faster than the 
respective single-precision CPU versions. 



mover algorithm for both the VA and VP approaches with respect to timestep, field strength, 
and number of particles. Clearly, while the performance is insensitive to changes in the 
field strength and the particles number, it is quite sensitive to changes in the timestep, 
improving with increasing At. For a timestep of 0.1 (typical in explicit PIC simulations), the 
algorithm is close to a memory-bounded regime, and the performance is low. As the timestep 
increases from 1 to 10 or larger (typical in implicit PIC simulations jl2j| ) , the algorithm 
becomes compute-bounded, and both the VA and the VP recover good performance as well 
as efficiency. 

4-2. Performance comparison between the CPU and GPU implementations of the particle 
mover algorithm 

For compute-bounded algorithms, a CPU-GPU performance comparison is informative 
when the speedup is measured against the theoretical peak performance speedup. For a 
single Intel Xeon CPU X5460 core at 3.16 GHz (used in this work), the peak theoretical 
performance for executing 2 operations on 4 single-precision variables in the SIMD (single 
instruction, multiple data) style per clock cycle is: 

2SIMDOp/CC x 4 value/Op x clockrate = 25.2GOp/s. 

Therefore, assuming the same efficiency on both architectures, the nominal GPU-to-CPU 
speedup is around 60 (~ 1581/25.2). 
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Actual CPU vs. GPU timing comparisons are depicted in Fig. [51 We have programmed 
the CPU code with standard C/C++, without implementing any explicit Intel SSE SIMD 
instructions, or multi-threading using multi-cores. We have relied on the compiler (Intel 
C/C++ compiler version 12.0.4) to optimize automatically. We first compare the perfor- 
mance of the baseline algorithm between the CPU and GPU (this case is identical to the 
one shown in Fig. [3]). The codes are very similar on both processors, except for fast mem- 
ory management, which is explicit on the GPU (Sec.(j3J)). We find that the absolute GPU 
efficiency (~ 8%) is larger than that of the CPU code (~ 5%). The speedup scores 100. The 
increased speedup (100 vs 60) is consistent with the increase in efficiency (8% vs 5%). 

After the optimizations, the speedup reaches about 200 and 300 for the VA and VP 
approaches, respectively, corresponding to a GPU efficiency of 20-25%. We note that some 
of the modifications introduced in Secj3] (such as the optimized sub-timestep estimator) have 
also been used to improve the performance on the CPU. 

4-3. Performance of the hybrid, mixed-precision CPU-GPU fully implicit PIC solver 

We proceed to compare the performance of a hybrid, mixed-precision CPU-GPU imple- 
mentation of the fully implicit PIC algorithm vs. the CPU-only serial implementation in 
Ref. |12j (but incorporating applicable algorithmic optimizations developed in this study). 
We follow this reference and use the ion acoustic wave (IAW) case for our numerical tests. 
As set up, the IAW features large-amplitude waves that can propagate in an unmagnetized, 
collisionless plasma without significant damping. 

Figure [6] shows VA simulation results from both the hybrid implementation and the CPU- 
only one. We depict the ion and electron kinetic energy, as well as conserved quantities (local 
charge, total momentum, and total energy). The relative nonlinear tolerance of the JFNK 
solver is 2 x 10~ 4 for the hybrid CPU-GPU version, and 10~ 10 for the CPU-only version. The 
nonlinear tolerance is larger in the GPU version to prevent stalling of the nonlinear iteration 
due to lack of numerical precision in the calculation of the current density. As a result, 
the average number of Newton iterations per timestep is about 5 for the single precision 
implementation, vs. 8 for the double-precision one. 

Figure M demonstrates very good agreement between the two implementations over a 
very long simulation span (100 IAW periods or 8000 plasma periods). Conservation of 
energy is enforced to the nonlinear tolerance level, i.e., 10~ 12 and 10 -6 for double- and 
single-precision simulations, respectively. The total momentum is not conserved exactly 
in either implementation, but it remains bounded and fluctuates at similar levels in both 
simulations. The wall clock comparison of the two simulations shows a speedup of over 
130. The speedup reduces to 70 when the same nonlinear tolerance (2 x 10~ 4 ) is employed 
in both CPU-GPU and CPU-only simulations. These speedup factors are consistent with 
Amdahl's law, as the particle mover represents ~98% of the overall cost of the algorithm in 
the CPU-only implementation. Larger speedups are possible for problem setups where the 
particle cost represents a larger fraction. 

We have retrofitted the mixed-precision, hybrid CPU-GPU implementation with a defect- 
correction algorithm (32j that delivers a true double precision solution. In this version of the 
mixed-precision hybrid algorithm, the nonlinear residual in Newton's method is evaluated in 
double precision, while the Jacobian- vector product are evaluated in single precision. For the 
IAW problem, this hybrid implementation requires only about a third of the GPU particle 
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Figure 6: Long-timescale simulation of the I AW problem, comparing simulations with the particle mover on 
the CPU using double precision and on the GPU using single precision. Panel a,b,c,d are the time history 
of ion kinetic energy, electron kinetic energy, total energy variation every timestep, and normalized total 
momentum, respectively. 
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mover calls to be evaluated in double precision. It is important to note that double-precision 
computations in our GPU target architecture are expensive, not only due to a factor of 8 (or 
2 in some of the newest GPUs) cost increase of double-precision operations, but also because 
current GPU architectures do not yet feature double-precision versions of the fast intrinsic 
functions or atomic operations. Nevertheless, our defect-correction implementation of the 
mixed-precision hybrid algorithm is still able to achieve a speedup of 40, which is 1.6 times 
better than a full double-precision hybrid CPU-GPU computation. 



5. Discussion and Conclusions 



This study has explored the hybrid, mixed-precision implementation of a recently pro- 
posed fully implicit PIC algorithm 12j on a heterogeneous (CPU-GPU) architecture. The 
implicit PIC algorithm is ideally suited for a hybrid implementation owing to the concept of 
particle enslavement, which segregates the particle orbit integration in the nonlinear residual 
evaluation. Accordingly, the particle mover can be farmed off to a GPU, while the rest of 
the nonlinear solver machinery remains on the CPU. 

With the aid of the roofline model, we have demonstrated that the implicit adaptive, 
charge-conserving particle mover algorithm of Ref. jl2| is compute-bounded, unlike memory- 
bounded explicit particle movers [Hl-Q, [sll- The optimized parallelization of the particle mover 
on the GPU significantly boosts the performance compared to both the algorithm's original 
CPU implementation and a straightforward GPU implementation. The optimized parti- 
cle mover exploits the powerful floating-point units and fast special-function units on the 
GPU without loss of accuracy. Significant acceleration has been achieved by eliminating the 
very-expensive IEEE divisions and square-roots, and by minimizing the creation of divergent 
branches. We have adopted a novel particle sorting strategy that sorts particles according 
to both their positions and their velocities to improve memory efficiency and load balanc- 
ing. The operational throughput of the optimized particle mover algorithm reaches 300-400 
GOp/s (for VP and VA, respectively), on the Nvidia GTX 580 GPU in single precision and 
with typical (large) implicit timesteps, corresponding to 50-70% intrinsic efficiencies (vs. the 
maximum algorithmic throughput) and 20-25% overall efficiencies (vs. peak throughput). 

Moment accumulations from particles are often quoted as a main bottleneck in many pre- 
vious explicit particle mover algorithms 0, S, 1Q|. In contrast, the implicit mover algorithm 
spends most of the time pushing particles. This is true even for the VA approach, which re- 
quires memory accumulations at every sub-timestep. We find that the atomic accumulations 
in the VA approach take about 30% run time of the mover algorithm. If the VP approach is 
adopted instead, the accumulation overhead becomes almost negligible. For the other parts of 
the mover algorithm, we find that the time spent in particle cell-boundary-stopping (needed 
for exact local charge conservation) is comparable to that in the sub-timestep estimate and 
the Crank-Nicolson mover. 

Significant speedup of the whole implicit PIC algorithm is achieved by the hybrid, mixed- 
precision CPU-GPU implementation (using single precision in the GPU and double precision 
in the CPU) vs. a CPU-only one (using double-precision) on a challenging multiscale test 
problem, the ion acoustic wave. Speedups about 100 are found, which are consistent with 
Amdahl's law. Careful comparison of relevant quantities (local charge, energy, momentum, 
and the electron/ion kinetic energy) shows very good quantitative agreement. Particularly 
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encouraging is the fact that errors in momentum conservation seem unaffected by the mixed- 
precision character of the hybrid implementation. Similarly, JFNK performance seems un- 
affected by the use of single precision particle computations on the GPU, as long as the 
nonlinear tolerance is adjusted accordingly. Overall, the mixed-precision hybrid implemen- 
tation is found to provide a robust enough algorithm for this simulation. 

A mixed-precision CPU-GPU implementation that delivers a true double-precision simu- 
lation capability has also been implemented using a defect-correction approach. The speedup 
(vs. the CPU-only double precision serial simulation) is about 40. This outperforms a full 
double precision CPU-GPU implementation, which results in a speedup of 25. 

This study demonstrates at a proof-of-principle level that hybrid CPU-GPU implemen- 
tations hold much promise for future investigation in the context of fully implicit PIC al- 
gorithms. Future work should focus on extending the approach to multiple dimensions and 
larger domains. With current limitations in GPU shared memory (a Fermi GPU currently 
has at most 48KB of shared memory per SM), larger problem sizes will require the use of 
domain decomposition among GPU nodes. Given that the particle orbit integrator in our 
implicit PIC algorithm may sample a relatively large fraction of the domain, special attention 
will need to be paid to devise strategies to minimize communication overhead. Solutions to 
these issues will be explored in future work. 
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